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Much has been written about evaluating t:he academic outcomes of Title' I 
and other compensatory programs, but little attention has been' given to 
the unique problems of evaluating summer programs* Evaluation designs 
developed for use with regular term Title I programs "may be inappropriate 
or must be> adapted for use with summer programs. Perhaps because the 
Title T regulatioos are more flexible and there ^no school schedule to' 
work around, there N is great variation in these programs.* Tfte short term 
nature and limited focus of summer programs creates special difficulties 
for the evaluator. *The characteristics of a particular program can make 
certain' evaluation design^ ar^ more appropriate; than others". 

s <* 
Since randomized studies are rarely possible, the 'evaluator must be able 
to select a design that controls for alternative explanations of the 
results. The purpos'e of tjusVpaper is to suggest evaluation designs that 
seem' appropriate for summer Jitle I programs .and procedures ?or^ 
implementing these^designs. We* will describe various dimensions of 
summer programs which affect the choice of an evaluation design, and weT 
will discuss threats to the validity, of summer program evaluations that 
are particularly troublesome. ^ , 

• ' • « 
Since this paper^presents an overview, it poes not provide the procedural 
or technical detail necessary to full implement the recommended designs, 
but references are provided for further information. We recognize that 
most 'districts do not have staff trained in evaluation methodology ^nd \ 
v tiave tried to focus on kdy concepts .and practical suggestions. % 

Characteristics of Summer Programs ' 

Af tek reviewing. program descriptions and evaluations of summer Title I 
programs* from several western states, we have identified sources of 
variation in summed prog rams that are critical in .planning an evaluation 
of accidemic^achievement . Before a school district adopts a particular 
evaluation design, careful consideration of how the nature of the program 
affects the outcomes 'o'f the evalu&tit&n is necessary. \ 

Instruction . A summer program may last anywhere from' two to eight 
weeks. 'Discing .that time instruction may be givfcn as much as six hours^a 
day or as little as a couple of hours a week. This'means that while some 
prpgrams cover a broad range of objectives, bthers^focus on a few 
specific objectives. This range has several implications for selecting a 
measure of impact and an evaluation design. ■ 

First,, it seems unreasonable to expect to detect growth on a? standardized 
achievement test unless the program has a broad_fodus„ancL^a£har_. 



.intensive instruction. Most school districts currently^ use standardized, 
norm-referenced achievement tests which sample only a small number oi. 
I'teras in any akill area. With a program having a narrow focus, it is 
possible that only one or two items cover what was taught. The 
importance of selecting a test that will be sensitive to the , instruction 
provided by the program cannot be^ stressed too much. 

^Second, the limited, time available for instruction in some programs 
preclude spending touch time in testing or conducting t'he evaluation., tn 



some .situations, spring an£ fall test, data can be used g father than 
, administering ackiitional tests. In other situations, tests that .are Used 
for student monitoring, can be used, * ' r * 

Individualization , While many Title I programs attempt tb individualize 
instruction, the degree ^.^individualization differs considerably. 
Sometimes this means individual help whijle working on exercises. In 
other programs each student sets bis « own pace working through the same 
materials. * t In still others a unique set of activities J.s prescribed for 
each student. Often', individualized work is interspersed with group 
activities. ' ♦ < 

• •* ' * 

One implication for program evaluation relates to the selection of a test 
to measure growth. The extent to which each student . teceiv.es a very 
different treatment determines whether a single pool of item^ could be 
considered an adequate measure of achievement for the group; WhilAa 
Inroad ranged 'test may be appropriate, -often a criterion-ref eren&adV 
measure tailored to each Student would be preferable. 

Nonacademic objectives . Many summed. programs can be distinguished by 
their emphasis on affective outcomes. While there is instruction 'in 
basic skills, high interest 'materials and supportive activities are used 
# to improve the students 1 self esteem or attitudes toward school. .In such 
a progran} it is appropriate to supplement or replace achievement measures 
wfth attitudirial scales, observation, or other measures. One wo*ul'd 
expect such a program to improve student achievement but perhaps 'not % 
immediately. ^ ' fc 9 % ; 

- Student selection Criteria . The procedure for selecting students into a 

program is a very important consideration in planning. an .evaluation. 

Like other Title I programs,- summer projects generally. use an. informal - - 

process based on teachs* referral and perhaps a general cutoff score on a 

standardized ^test. The referral, however, might be based on such • 

criterion/as the student needing assistance in a particular skill area, 

in building ^Self-esteem, or in-general skill development. Generally, the 

students selected can elect not to attend. Oftep the participants are 

primarily students who participated in the [regular term program ,but 

sometimes the criteria fori determining a "needy'* student are*greatly 

relaxed so that the district can be sure to fill all "slots *in the program. 

• * .. > 

One implication of the selection method is that* if the, pretest lores' are 
used* to select students, regression to the mean could confound the 
treatment' ef ^ec't^rn^s^me designs by making the program lo<5k more 
effective than /it actually Was\ Another implication is that, unliKe 
regular term prjograms, ther4»is often an excellent opportunity 'to select 
a comparison 4 - group -of students who a*? eligible for Title; I but who do 
not participate.. Such a comparison' group would have to 'be, selected using 
the same criteria as the participating students; « t 

Other instruction . Since the eva}.uator n*ants to distinguish between 
^.'♦growth due to the Summer program j^nd growth due s to other factors, it is 
^ important to consider influences outside r thV program. No* other 'formal 
^^instruction is likely during the summer so it may seem reasonable to 

r : • . > • ' . • ■ • 




expect the same or Worse performance at posttest time without the summer, 
^program, Hodever, the stqdent "still ^reads v and applies math skill's 
throughout the summer^ Growth should be ggpected on'certain skills, due" 
to practice, informal ipstruction Dy parents, or maturation,. It appears 
that^ "summe^/ dropoff " involves slowing, b.ut not stopping the student ,# s 
rate of growth, at least as measure<M*Lth standardized achievement tests 
administered spring and fall (JJob'er^r^ 1330 ; Stenner & 31and, 1979). At 
any.<rate, a comparison group or other consols may be necessary to * 
distinguish growth 'attributable to the progi>am-.\ 

Relationship vo regular term pro^ram ^ The similarities in the * 
objectiffs, materials, and students between, certain summer programs and 
their regular term Counterparts opens up che possibility ^of integrating 
the evaluations of the two components-, ft'hen cfre summer activities are 
simply v an extension oi the regular term, ther^Laay be little value in 
•evaluating chem separately. f n * 

Evaluation Designs for Summer Programs* 

Educators have a long "tradition of using both a pretest and posttest th 
determine how much growth occured during the efducacional programs 
However, since many of the evaluations conducted have failed to estimate 
how much would have oeen 4earned without the program, the results "of 
these studies have often been uninterpretable. / 

\ / 

Designing an evaluation that will yield irtferpretable results is like the 
criminal lawyer preparing a court case. He ^veioDs a theory of how £he 
crime took place by looking for .evidence tWt*-£limLnates as many 
alternative explanations as^ possible. .In" program evaluation, the 
evaluator makes assumptions that lead to estimates of how students would 
have performed' without the program, and collectsg^feta in *a way that makes 
alternative explanations^ of the results less^plausible . The effects of 
the program must be separated from that of maturation, other classroom 
instruction, informal instruction in the home, testing problems, 
characteristics of the particular group selected, and so on. 

\ # 

Three evaluation designs are suggested here. *£ach makes different 
assumptions about student growth that allow the evaluator to estimate how 
students would have performed without N the Title I instruction. . Because 
each assumption may be tenabie only for summer programs having certain 
characteristics, none'of the designs are universally applicable* 

In describing each design, we have tried to suggest procedural guidelines 
and potential variations. We suggest what types Qf programs cranbest'be 
evaluated by each design and listed advantages and disadvantages of each* 
design. 

Norm-Referenced Design 

The norm-referenced design compares the growth of -the Title I^udents 
withlthe growth that would .have been expected without the Title -5 ^ : 
instruction. This no-treatment; expectation is determined by bhe posttes? 
performance of students in the norming sample of a standardized test 

)- . ' • • . 

! * / 



whose pretest status was the same as the Title I group. At postiest time 
the expected, achievement ^subtracted from^ the actual achievement* to 
determine the effect of the program, * The design is used essentially the 
same-way' as in evaluating regular terra Title I programs (Tallmadge & 
Wood, 1980) - « 

Requirements £or Use ■> . ' 

m * . . * 

The norm-re fere qced design seems best suited to summer programs which 
provicfg intensive instruction on a broad rar/ge.of objectives. 'Tests that 
are normed typically sample a very small number of items from each skill 
area so that achievement of a wide range of objectives -c^an' be 1 assessed. 
It is unlikely that such a test woul'd be sensitive to^instrqction that 
focused on only? a few^objectives or that w^s quite short in duration,' 
The Title I program should provide three err mo^e hours a day for six *to 
eight weeks. » * * . . . * s 

> • » < 

The basic assumption of tfte narm-ref erenqed model is *that % without ' 

treatment students will tend to remain at the-same percentile rank, 

'(Note, however, that 'since the scores of an individual student fluctqate 

too ratfeh, the assumption is based on the average of the group.) If 

students, in the local population fare not similar* to those in the normin<Jj 

population or if the school curriculum differs from those \ school^* • 

included in the sample, this "equipercentile" assumption may not hold ajid 

the estimate of treatment effect may not be accurate. 

.1 

There are several ways. to cheqk this assumption.* Check' the description* 
of the norraing sataple in the technical manuals Qf the test to make sure 
that students dike those in the '.local popqlation were included in the 
norming. IX historical data are available *Qn fche local students, one 
could ctteck to see^if, on the average, students in' the drstrict: maintain 
their percentile ranks prior to Title I instruction. 

1 * - 

For evaluating summer programs, the design also assumes that little or no 
supplementary instruction occurred between pretest and posttest other 
than the summer program.' Since the test must generally be given* in the 
spring and fall a): norm dates, some students might feceiye several veek N s 
of regular term Title I instruction Uetw'een the pretest and posttest, 
-tfius confounding the summer evaluation. On the^ather frjand, if the 
regular term program starts late and ends early ip thf. school year, there 
may not be a problem. - S ' * * 

The design it also well saited for the progra/rt^whiqh extends the ; , 
materials arra objectives of the 'regular term Title I program with the 
same students. In some cases it' may not be w6rthwhile to .evaluate the 
two f programs • separately -so that the focus ®f the evaluation would' be 
their combined e'ffectar,* If many reaular term 'students do not participate 
during the summer, gains co^ld ,be> cpmputed separately for the regular 
term only arid the regular plus summer participants. 



ferdcedural Guidelines t 

r * * ■ 

The guidelines for the norm-referenced Resign are'^detailed in the User *s 

/ Guide (Tall^adge & Wood, 1980) for the Title I Evaluation *and Reporting 

„ System, The .following -are some* highlights as they spply to summer, 

programs. , ^ r * , 

^ Select an appropriate standardized test .. The tes't must have empirical' 
f ' ' "'national or local norms f6r the spring ,and fall! These norms should be ■ 

, . basecj on* a -large sample of students which is representative of the local * 
9 m • group o'f students. The test should reflect what is tau^ftt. in" the s r ummer 

program and should contain few ob]ectives that are*not covered. It would- 
''be difficult to find, a suitable norm-referenced test for short summer 
I - programs that focus on a limited set of ob]ect*ves. ^ 

. % . " y \ # / Jr. 

Select students , vptudents most in need of supplementary instruction 
shbgld be selected for the program.* This may, be done using a. variety of 
' j . Methods as long as the pretest scores wer'e not used in any way \n the 

selection process. This is to avoid regression to the mearu S^ince 
s participation^ in the summer program is usually voluntary , statistical 
adjustments" for regression that require a'stnct pretest cutoff for 
selection cannot be used. . \ « *• 

c * . • 

Pretest and postfcest students .. The spring protest and fall posttes't \ 
\ should be administered within £wo weefcs of the empirical norm dates of * j 

the test and^he instruct ions , for administering the. test must, be 4 
carefully followed^. If the test^cannot be administered close to' the norm 
, • * dates-but witnm six weeks, it is possible to interpolate the norms. 

Students should be tested ^t their functional levql since the test level * 
recommended for each grade level may De too difficult, for Title I 
stade«nts. An attempt should be made to use 'the form of the test used b/ 



the publisher in tfiei noting study. When possiDle, the same form and 
s level of the test should be used for both pretest and posttest. 

Compute the treatment effects Only the data from students haviqg both a 
, pretest and posttest score are used'in^the analysis. The pretest and 
pdsttest scores are converted % to ,NCEs or expanded standard scores and 
averaged. The effect of the program is the avecag^ posttest NCE minus 
the average Dtetest NCE. * ^ 1 

v . 

Advantages* and Disadvantages . » # , 

Ttye primary advantage of th6 norm-referenced design is*its simplicity and 
\f amiliarirty to schpol district staff. The procedures are not difficult 
'to implement and shoqld be familiar io those^ who haye 'evaluated regular 
term programs. Another advantage is that no>additional testing may be 
required if the district has already administered an appropriate tes^t as 
part o°f a district testing program or'other evaluation requirements. * 

The design has several disadvantages. First, the assumption that the. 
Title I students would achieve at the same rate as students in the 4 
norming sample having the same initial /status may not^e acpurate for the 
"local population. Second, regular terijr Title I instruction that occurs 
after the,' spring pretest or before 'the/fall posttest will bias*the 
§ estimate' of the cjains.' Third, the' rather long time span between pretest 
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- and pofettest allows njany opportunities for learning (and forgetting) to • 

occur due to factors other than £be program. ^Fourth, the te^t selected 

will probably be sensitive to only rather . intensive suiJuner instruction. 

. * 

Criterion-Referenced Design [ 
»* • * > z ~~*^ * 

A criterion-referenced design determines the 'progress of Title I students 
with respect to a specified criterion or standard. The criteria or ■ 
.standard may be defined in terms of some minimum number of test items' 
answered correctly or in terms' of some level of performance. Iq either, 
case the measure of achievement is referenced directly to .well defined' 

0 instructional objectives. The effect of the program is determined by ■ 
, comparing, the* posttest performance of^ students c with their pretest 

. performance. . ' * / 

Requirements for Use 

The criteripn-referenced design seems best suited to summer programs 
which are' based on carefully identified objectives, especially when very 
different objective^ are prescribed for each student. Often these will 

• be very short programs which provide help to students in. a fairly limited 
number of *skill areas or objectives. \ ' 

♦ * 
This design is essentially a pre-post design *on a single group of 
students and lacks a procedure for determining whether the observed 
growth is due tp the program or to other factors. The assumption is* that 
ir\ the absence of the Title I instruction, stuejefrgS* wo uld, not have 

„ improved on the posttest. This seems like a* rather ^strong assumption, 

* yet there do seem to be situations in which the assumption might be 
reasonable. Unlike regular term programs*, the student is not < receiving 
,any formai instruction in addition to Title I. -If there is only a sj?6rt 
time internal between the pretest and ppsttest,, it is unlikely that mucn 
growth would be expected due to maturation, informal instruction irv tfhe, 
home, television programs, or .similar influences'. Also, the development 
of some skills seems to be more dependent on direct ipstrjiction than / 
othe&s. 'For example, improvement on a specific list »of vocabulary words 
can be more easily attributedto the program than can' improvements in 
general comprehension objectives,. * Matheraatic tests appear to be more 
sensitive than reading tests. A large district might -be, &b,le' to plan a 
study to determine whether. any growth ^tertds tt> occur during the interval 
between pretest and posttest without direct instruction. 

* . * * V- ' 
One-point of clarification would* be made'at this point. - The 
criterion-referenced design is not; equivalent to criterion-referenced 

' testing . „The design may incorporate criterion-referenced tests but those 
same tests could be^ used in either qf the other designQ/as weii. 

Procedural Guidelines x t ^ *' 

y^The criterion-referenced model is*yery flexible ahd there are many 
possible variations in the way the mode,l might be applied,. Some 'basic 
guidelihesare^ * , * " 



' Identify clear instructional objectives . The critical feature, of this 
design^- is* that " student achievement is" referenced to a domain or set of 
behaviors.* I,f the objectives are clearly defined, on^ can easily * 
distinguish between test items or tasks that do or' do not measure 'that 
objective. Knowing the student's performance on the test, one will'fcnow 
which domain or skill areas £be studervt has mastered 'and whiten areas ^re 
weak. Some types. of objectives (see tfitko, 1980, for a review) are: 

1. Specific skill or knowledge such as "addition of three pl^ce 
numbersfewitji carrying" • 4 ' 

* 2. -A continuum of skill complexity like "addition of two single 
digit numbers without carrying* 1 through to "addition of three 
3-digit numbers with carrying" ' • * , ^ 

3. Proficiency from novice to expert in a skill such as composition 

Objectives can be. developed either De^ore or* after students axe selected 
into the program, ff after, *then objectives and criteria for successful 
completion of the Objectives could be individualized." 

Select a sample of itefrs 'to Measure each objective or 'skill domain . A 
valid, and reliable measure of each objective or skill domain to be * 
•evalu'ated is needed.* The options include: , ' •* 

x I. K commercially available criterion-referenced or diagnostic tesft 

that matches the curriculum fc 

• i « • * * 

- t r 

2* A*te # st developed to parallel th£ program materials 

3.^ 'A customized test developed ^from an item* bank or by a district 
. testing office w , • 

4*. A skills checklist that- can be administered reliably * % 

Some guidelines for developing or evaluating criterion-referenced tests 
are available (e.g., Hambleto* & Eignor,. 1978; Popham, ^978) , but the - 
field is complicated by the diversity^ in typfes of these tests. % 



Establish Performance Criteria , . 1 ' # 

If a mastery approach is taken, a performance criterion shorflti be set* up 
for each skill so that one -can determine when, an object^ve'^as been v 
reat:h6cf. These criteria could be. in relation to one o£ the following: 
* " * * ' $ /, *" " \ 

1. Number of items correct 6tf each cluster*of ite*m$ 

t € < 

2. Proportion of individually prescribed oojectives mastered 

r 1 - 
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4. 



Proportion of students mastering each- objecViv^'or. meeting each- 
criterion * ; 

Score indicating degree of proficiency/ or' level of tasK 
complexity • » 



Select program participants* /. The most needy students are to be selected 
/mt9 t^e summer program. This caij be done using a variety of procedures 
f. a§ 4 lojig as the pretest scores are not used iti any way in the selection 
1 Process. Other test scores, teacher observation, progress through a * 
^reading seriers,' and other measures can beused. / * - 

f Pretest and posttest participating students . The nature of the objectives 
and criteria for success determine when pre and posttestdng can occur. . 
In some gases, v the same measure is given just before and just after .the 
♦ jrelevant unit of instruction. In other cases testing would occur at tfie' 
beginning, and end of the program. Note that "the test items completed may 
Be different 'for each student if each student receives instruction^ on ' 
different objectives. " 

4* T * 

Compute .t he~growth in achievement . • The difference in achievement between 
^the pretest and* posttest can be expressed In tjhe manner specified b? the 
^ritefia for success established above. • 

Advantages and Disadvantages 

The. main advantage of the desigg is its flexibility. The design can be 
adapted to many -types of programs. Suitable measures for student 
monitoring may already be incorporated in the program. Tftere are few 
constraints on when testing- should take place and there is 'no comparison' 
group*. Another advantage is that tki& design can yield timely information 
thatfis tied directly to instruction. 

A major disadvantage of the design *is that the design is weak. 
Differences in perforipance-m$^e due to causes other than the Title I 
program* Its flexibility may conceal measurement proolems or ill-defined 
objectives. If the desigri is not carefully integrated into the 
instructional program, much time can be wasted^ in testing and 
recordkeeping. The test: information will be most useful if tied directly 
to the curriculum. * 

* ' ft 

' Comparison Group Design, ' i 

* ' ' . . ' 

The comp^ison group design is implemented by establishing two groups <&f 
students that are similar in all respects except that one group ' ^ 
participates in the summer program and the other dQes not. -Both groups 
are pretested and <posttested under the $ame conditions and at the same 
time. The. relative progress, of the jtreattoent group over the comparison • 
group yields art estimate of the effect of the program/ The ^design is 
used in the same way as" fo£ evaluating other Title I programs'' (Tallmadge ' 
& Wood, 1980) . . 

Requirements for Use * 

,The assumption behind the^ design is that the comparison group is an 
adequate control for alternative, explanations 61 the evaluation results 
so that any differences' in the progress of the g-roups over the summer can 
be attributed to ,the Title I program. In 'order 'to ensure the 'adequacy of 
this control t&e ,two groups need to be very similar 6n all educationally 
important ^variables such as, race, gender, socioeconomic status,, and , 
pretest states'. ^ *, 



The ideal way to ensure equivalent groups is ta randomly assign students 
'to the treatment and no-£reatment conditions. Since randomization is 
rarely possible in field settings approximations to true Randomized 
selection must be used. The best way 'to «proceed is to select a 
'comparison/group that can be. logically assumed t'o be equivalent to the 
Uitle I $roup and th£n compare , the .two groups on educationally relevant 
variables to establish that', in fact, ^hey* are similar, Finding a 
comparison group for a summer program prooaoly *is easier than finding one 
for a regular termtfngram/ 

• / *W ^ ' 

Procedure Guidelines *• b 

The User's Guiae (Tallmadge s Pagan, 1980) describes the procedure m 
more aetail, but some highlights that pertain to summer 1 prog ranfs follow. 

. ; .• - c- 

Select a test . The test chosen snould measure the .Bummer program 
objectives. Tne test used need not have norms. However, the test chosen 
must still be re Liable and valid for measuring progress of students, 

* Select Title I and comparison students .. Students in the treatment and 
comparison* groups mus*d be selected in the same manner in order to avoid 
differential degression. between the groups. Some options for selecting a 
comparison group are: , 

• Select a pool pf students eligible for the summer .program. 

^ Jhose^tudents who do&not elect to participate would serve as 
the comparison group. ^Consider whether- the 'elective nature of 
this selection method produces a nonparticipatmg group that .is 
different Vn educationally relevant ways from the ^Ti tie I group. 

> # The. Comparison group students .might be selected from another 

school not participating in the summer program using the same 
^ objective procedure objective criterion used in the. 

partici^atmg^chool. ^ „ * 

( . 

Students must be selected who have the same educational experiences 
,betwe'en the pretest and posttest as the treatment group except 
participation in the. summer program. Since J testiTng must generally occur 
in the spring a'nd fall, consideration must be given to whether students 
*re receiving any .different regular term Title V instruction during that. 
, period. Check the similarity of the Title I and comparison groups.,. 

Pre and posttest . -The treatment and comparison students must be tested 
at the same time under the same conditions. Testing need not occur on , 
the empirical norm dated if a norm-referenced test is used. Since it 15 
rarely possible to test the comparison group students 'during the summer, 
testing must usually occur during the spring and fall. 

w " - ■ ■ 

Compute the results . 'include- in the analyses only. those treatment and 
com^a risc/n' students that have both pjre and posttest sceres., Tire decision 
of what analysis to use is complex and* controversial. Professional 
judgment is needed. When students are .randomly assigned (or wnen 

• assignment "is. random in effect) it is possi6le*to u$e analysis of. 
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covariance to adjust /£or any random 4if ferences between the pretest 
status ,of the two groups. When groups are assigned based orust-able group' 
difference such* as volunteering for ,the program/ as is typical in this 
application, Tallraadge & Wooa (1980) suggest a principle axis adjustment 
(see* also Kenhy, 1975). 
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Advantages and Disadvantages 




The major advantage to this desijgn is that it provides the best estimaj 
of the effect of the summer program on' enlevement , if an adequate 
comparison group is used. Finding 'an adequate comparison group is easier 
fo'r summer programs than for other Title I programs, # Regular testing 
data can also usuall^be used, so that no extra testing is required, • 

The major disadvantage is that- the 'analysis of the data requires more 
sophistication if statistical adjustments are i*sed. Also, it may be 
difficult t9 ob.tain a comparison group unless testing is done' in the 
spring and fall. 



' Discussion • 

In this paper we,- have proposed three evaluation designs that seem 
appropriate for summer Title I programs. - We included only designs which 
seemed relatively easy to implement and which would yield reasonably 
v^lid conclusions when implemented properly. .There will be situations, 
however, when none of the suggested design^ can be implemented without 
violating assumptions'or guidelines. For example, a fcnef program w£th 
poorly defined objectives and -no suitable testing materials would have 
difficulty implementing 'the 'criterion-referenced or norm-referenced 
design. The summer , program using a norm-referenfced test for which the 
empirical norm dates overlap \witn the regular term program will confound 
the effects' of the two prpgraitfs. 



S 



ir progrc 



Regardless of the? design selected to evaluate ^summer programs, there are 
several issues that should be 'considered before implementing the 
evaluation* , ^ * " 

iMatch betweeln content of test and instruction . The match between wjiat is 
taught and what is assessed is a' very important feature of an/" 
evaluation; Wh^le^this may seem intuitively obvious, it is probably tbe 
evaluation guideline most frequently disregarded. *Too often, a summer 
program with a definite focus will* be evaluated^ using a broad 
standardized achievement test.- A total reading scoite, for #in stance!, 
should not be considered a reasonable •measure of a student achievement in 
a 2-week summer program working exclusively on .vowel sounds. It is 
unlikely that more than a couple x>f the test items would reflect what was 
taught. * * , - ! 



^If a test is an appropriate tool to evaluate a particular program, a.t 
least four conditions should hold: (ir the, test should measurt most of 
the instructional objectives; (2) th£ number of items in each s*kill area 
should be roughly proportioned to\the relative emphasis of th^t skill in 
instruction; .(3) there should be few ite;ns on the' test that measure 
objectives that were not covered during the program, and (4V the Jtasks or 
* ^ ■=» * 
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item formats should be~fan4<Liar to the students 4 .*Eabh oi^these onteria 
should be. applied irt evaluating the iest oc subtest selected for tne ^ 
•evaluation. * ' •* > * s v ♦ 

Number of student's . To* ba,se 'decisions # on Evaluation' resul^p, ope would • 
like to fre Confident .that the observed^ evaluation results are di^e^o' 
actual program effects ratfter than 'random "flu6'tuation due to measurement 
error or'- to 1 * the, particular sample of* students participating. Unless c^ata 
from a substantial nuAber of students were uBetd in computing achievement 
gains, error can easily obscure program etf^:ts. Since few s&hool \ 
districts nSVe.the resources to apply statiMjcal'signif icance tests ,to 
•their evaluation data, the evaluation r&sults ,ar4»of ten* used without 
Regard to ilfe number of students participating in the program. 

Topically*, sunurter programs serve only a small number of students. 'Even 
if there is little turnover for the -duration of the project, estimates of 
program impact will generally.be based on a limited nuiftoer of pre- and 
posttest; Scores- - ' • 

Since the number of students served oy a" program, cannot be increased 
simply to improve the evaluation, other methods musk be sought to 
increase the stability of the results. The accented methods are to 
follow through in make-up testing, to Aggregate results, and to watch for 
trends.. - ' " * 

When the ^ccJres of only-a few program participants are /availaole, 
substantial improvements in tljg. 'stability of group me^ms can be realised 
by ensuring* that tfotk pre- and posttest scores s will be available ,from as 
mjjny students .as possible. This requires make->up testing. \It 4 may also 
require coordination with other schools to obtain scores o^Wudents -who 
have transfefrecF within the district or graduated.' . 1 

<• • • 

Aggregating scores across scto»l buildings, across years 'of the program, 
or across grades is a 'very effective method of increasing the stability 
of evaluation results. When the program has been* implemented in a „ 
similar way across o'ne or more of y these dimensions, better estimates of 
program effectiveness can be comjjuted from the comoine^ scores. Fot ^ 
exai|ple, ^ difr^rict offering*a small program of"£ive students in both 
third and, fourth .grades, might average the scores from both grades and 
from two years^of the program, yielding a single- gain* based 'on about 2D.' 
.Students* Care must be takei}^ though', to combine scores from differen^r ■ 
grades only if NC&s are iise(f. Care must also beltaken in interpreting^ 
the" results when such combinations are usee 



Quality; control . Thfc effects ^of summer Title. I programs tnat can be 
detected with any <ff the ^ValuatAjLdesigns is likel^^tcKbe "small. If 
tkm evaluation is not done carerllly, the 'resulting error can obscure the 
actual treatment effect. Experience with^evalbations or regular terjn 
Title I~ programs suggests that sonje of the mpre common sourcei^of error 
are: , ' 0 ' 

, • „ Failure to follpw. the' guidelines for the evaluation design 
~* • Lack of match between the content of the test and what was taught 



w • 

t ^ 




I 

» v *• ^ - • 4 

Improper* test administration 

recordkeeping, scoring, and score conversions 

Due to the brief nature of many sumraelr 
programsV-taachers are understandably reluctant to use instructional time 
for testing students or planning* time for scoring those tests. We have 
Suggested above several ways to make maximum use of scores that may have 
been collected for other purposes. It may be possible tQ use spring and 
fall scores from a districtj^ide testing program or from the evaluation 
of the regular term Title I program for the evaluatioh.' The 
norm-referenced -design would be implemented most ef^iently in this 
way. Student- monitoring during the" program using terns* referenced t 
curriculum can often be. used % for evaluation purposesJPfc>articularly wi 
the criterion-referenced. design. 

Other forms of evaluation . Measuring *th'e academic achievement of summer 
Title % students, is not the only form o£ evaluation that can provide 
useful' information about the ef f eqtiveness of the program. It may not 
even b'e the most efficient ot. useful' for this type of program. There are 
other ^outcomes to consider r .particuarly with programs "which emphasize * 
attitudes ,cfr self^cohcept. Process or implementation evaluations can 
provide information ^pout hpw well the^rogram is functioning from an 
other perspective or about the extent to which £he program was 
implemented. Often, the Results of such Itudies translate more easily 
into the program decisions than application of the designs discussed here. 





14 

V 



12 



9 

ERIC 



References 



Hambleton, R.K., & s£ignor„ Guidelines for evaluating criterion- 

referenced tests arid, test manuals* Journal of Educational . B 
Measurement , 1}973, 15, 321-327. I. ' 

Kenny , D.A. A quasi-experimental approach to assessing treatment 

effects in the noneguirvalent control group design. Psychological 
^ Bulletin , 1975, 82, 345-362. 

> f 

Nitko, A.J. Distinguishing tfce many varieties of criterion-referenced 

tests. Review of Educational Research , 1980, 50, 461-435. 

Pppham, W.J. Criterion-referenced measurement , Englewood Cliffs, NJ: 
Prerrtice-HaU/ 1978. : 

Roberts, S.J. Cognitive growth over tfie summer* RMC Research 
Corporation, February, 1980. 

S.tjenner, A. J., & Bland, J.D. Assumptions about summer grow*tn implicit 

in norm-referenced achievement te^fs. Paper presented to the annual 
meeting of the American Educational Researcti^^ociation, San 
Francisco, CA, April, 1979. 



TallmadgeV, G.K., & Wood, C.T. User's Guide — SSEA Title I evaluation and 
' reporting system* ., ^MC Research Corporation, Mountain View,- CA, 1980 



